ML CAC¶

* Name: Shruti Mall¶

* Register Number: 23122032¶

* Class:MSc DS A¶

* Date :16/05/24¶

Dataset Descrtiption¶

Screenshot 2024-05-16 at 7.45.59 PM.png

  • assembly_session: This column represents the session number of the assembly. It likely serves as a unique identifier for each assembly session.
  • state_code: This column contains numerical codes that represent different states. These codes are likely standardized identifiers for each state.
  • state_name: This column contains the names of the states corresponding to the state codes. Each state name is associated with its respective state code.
  • all_votes: This column represents the total number of votes cast in each assembly session. It provides an overall count of votes, including 'yes' votes, 'no' votes, and abstentions.
  • yes_votes: This column represents the number of 'yes' votes cast in each assembly session. It indicates the count of votes in favor of a particular motion or proposal.
  • no_votes: This column represents the number of 'no' votes cast in each assembly session. It indicates the count of votes against a particular motion or proposal.
  • abstain: This column represents the number of abstentions in each assembly session. It indicates the count of members who chose not to vote either in favor or against a particular motion.
  • idealpoint_estimate: This column contains numerical estimates representing the ideal point of the assembly for each session. It could be a measure of the assembly's preferred position on a particular issue or policy.
  • affinityscore_usa: This column contains affinity scores representing the relationship or similarity between each state and the United States. A higher affinity score indicates a stronger perceived alignment or similarity with the United States.
  • affinityscore_russia: This column contains affinity scores representing the relationship or similarity between each state and Russia. -Similar to affinityscore_usa, a higher score indicates a stronger perceived alignment or similarity with Russia.
  • affinityscore_china: This column contains affinity scores representing the relationship or similarity between each state and China. Similar to the previous columns, a higher score indicates a stronger perceived alignment or similarity with China.
  • affinityscore_india: This column contains affinity scores representing the relationship or similarity between each state and India. A higher score indicates a stronger perceived alignment or similarity with India.
  • affinityscore_brazil: This column contains affinity scores representing the relationship or similarity between each state and Brazil. A higher score indicates a stronger perceived alignment or similarity with Brazil.
  • affinityscore_israel: This column contains affinity scores representing the relationship or similarity between each state and Israel. A - -higher score indicates a stronger perceived alignment or similarity with Israel.

This dataset appears to capture voting patterns and relationships between different states and various countries based on affinity scores. It provides valuable insights into state-level dynamics and international relations.

Tools and Libraries¶

  • Pandas: Used for importing the dataset, performing data manipulation, and conducting exploratory analysis.
  • Matplotlib: Employed for creating various types of plots, including histograms, scatter plots, and regression visualizations.
  • Seaborn: Enhances the visualization aesthetics and provides additional plotting functions for statistical analysis.
  • Scikit-learn: Utilized for building regression models to predict median house values based on the dataset features.
In [ ]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
In [ ]:
df=pd.read_csv("states.csv")
In [ ]:
df.describe(exclude='object')
Out[ ]:
year assembly_session state_code all_votes yes_votes no_votes abstain idealpoint_estimate affinityscore_usa affinityscore_russia affinityscore_china affinityscore_india affinityscore_brazil affinityscore_israel cluster
count 9697.000000 9697.000000 9697.000000 9697.000000 9697.000000 9697.000000 9697.000000 9697.000000 9697.000000 9697.000000 9697.000000 9697.000000 9697.000000 9697.000000 9697.000000
mean 1987.037125 42.037125 446.914613 75.246674 59.793235 5.647932 9.805507 -0.000279 0.293589 0.620304 0.752558 0.687353 0.733821 0.350540 1.061978
std 18.478671 18.478671 258.472010 33.043167 32.657278 8.268681 9.713783 0.989763 0.203660 0.202982 0.160692 0.195748 0.186575 0.189557 0.925201
min 1946.000000 1.000000 2.000000 1.000000 0.000000 0.000000 0.000000 -2.562400 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 1973.000000 28.000000 220.000000 58.000000 38.000000 0.000000 4.000000 -0.661100 0.140800 0.512200 0.723100 0.526300 0.615400 0.193500 0.000000
50% 1989.000000 44.000000 437.000000 68.000000 57.000000 2.000000 7.000000 -0.175500 0.235300 0.652200 0.761200 0.754100 0.800000 0.325900 1.000000
75% 2003.000000 58.000000 660.000000 86.000000 71.000000 9.000000 13.000000 0.808900 0.388100 0.737700 0.869600 0.835800 0.880000 0.466700 2.000000
max 2015.000000 70.000000 990.000000 158.000000 156.000000 98.000000 73.000000 3.004200 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 2.000000
In [ ]:
df.describe(include='object')
Out[ ]:
state_name
count 9697
unique 197
top United States of America
freq 69
In [ ]:
import pandas as pd

# Check unique country names in the DataFrame
unique_countries = df['state_name'].unique()
print(unique_countries)
['United States of America' 'Canada' 'Bahamas' 'Cuba' 'Haiti'
 'Dominican Republic' 'Jamaica' 'Trinidad and Tobago' 'Barbados'
 'Dominica' 'Grenada' 'St. Lucia' 'St. Vincent and the Grenadines'
 'Antigua & Barbuda' 'St. Kitts and Nevis' 'Mexico' 'Belize' 'Guatemala'
 'Honduras' 'El Salvador' 'Nicaragua' 'Costa Rica' 'Panama' 'Colombia'
 'Venezuela' 'Guyana' 'Suriname' 'Ecuador' 'Peru' 'Brazil' 'Bolivia'
 'Paraguay' 'Chile' 'Argentina' 'Uruguay' 'United Kingdom' 'Ireland'
 'Netherlands' 'Belgium' 'Luxembourg' 'France' 'Monaco' 'Liechtenstein'
 'Switzerland' 'Spain' 'Andorra' 'Portugal' 'German Federal Republic'
 'German Democratic Republic' 'Poland' 'Austria' 'Hungary'
 'Czechoslovakia' 'Czech Republic' 'Slovakia' 'Italy' 'San Marino' 'Malta'
 'Albania' 'Montenegro' nan 'Macedonia' 'Croatia' 'Yugoslavia'
 'Bosnia and Herzegovina' 'Slovenia' 'Greece' 'Cyprus' 'Bulgaria'
 'Moldova' 'Romania' 'Russia' 'Estonia' 'Latvia' 'Lithuania' 'Ukraine'
 'Belarus' 'Armenia' 'Georgia' 'Azerbaijan' 'Finland' 'Sweden' 'Norway'
 'Denmark' 'Iceland' 'Cape Verde' 'Sao Tome and Principe' 'Guinea-Bissau'
 'Equatorial Guinea' 'Gambia' 'Mali' 'Senegal' 'Benin' 'Mauritania'
 'Niger' 'Ivory Coast' 'Guinea' 'Burkina Faso' 'Liberia' 'Sierra Leone'
 'Ghana' 'Togo' 'Cameroon' 'Nigeria' 'Gabon' 'Central African Republic'
 'Chad' 'Congo' 'Democratic Republic of the Congo' 'Uganda' 'Kenya'
 'Tanzania' 'Burundi' 'Rwanda' 'Somalia' 'Djibouti' 'Ethiopia' 'Eritrea'
 'Angola' 'Mozambique' 'Zambia' 'Zimbabwe' 'Malawi' 'South Africa'
 'Namibia' 'Lesotho' 'Botswana' 'Swaziland' 'Madagascar' 'Comoros'
 'Mauritius' 'Seychelles' 'Morocco' 'Algeria' 'Tunisia' 'Libya' 'Sudan'
 'South Sudan' 'Iran' 'Turkey' 'Iraq' 'Egypt' 'Syria' 'Lebanon' 'Jordan'
 'Israel' 'Saudi Arabia' 'Yemen Arab Republic' "Yemen People's Republic"
 'Kuwait' 'Bahrain' 'Qatar' 'United Arab Emirates' 'Oman' 'Afghanistan'
 'Turkmenistan' 'Tajikistan' 'Kyrgyzstan' 'Uzbekistan' 'Kazakhstan'
 'China' 'Mongolia' 'Taiwan' 'North Korea' 'South Korea' 'Japan' 'India'
 'Bhutan' 'Pakistan' 'Bangladesh' 'Myanmar' 'Sri Lanka' 'Maldives' 'Nepal'
 'Thailand' 'Cambodia' 'Laos' 'Vietnam' 'Malaysia' 'Singapore' 'Brunei'
 'Philippines' 'Indonesia' 'East Timor' 'Australia' 'Papua New Guinea'
 'New Zealand' 'Vanuatu' 'Solomon Islands' 'Kiribati' 'Tuvalu' 'Fiji'
 'Tonga' 'Nauru' 'Marshall Islands' 'Palau'
 'Federated States of Micronesia' 'Samoa']
In [ ]:
unique_countries_count = df['state_name'].nunique()
print(unique_countries_count)
197
In [ ]:
df.head(20)
Out[ ]:
year assembly_session state_code state_name all_votes yes_votes no_votes abstain idealpoint_estimate affinityscore_usa affinityscore_russia affinityscore_china affinityscore_india affinityscore_brazil affinityscore_israel cluster
0 1946.0 1.0 2 United States of America 42.0 25.0 15.0 2.0 1.7377 1.0 0.2143 0.752558 0.4762 0.6429 0.35054 0
1 1947.0 2.0 2 United States of America 38.0 27.0 10.0 1.0 1.8417 1.0 0.2632 0.752558 0.2973 0.8421 0.35054 0
2 1948.0 3.0 2 United States of America 103.0 46.0 54.0 3.0 1.9909 1.0 0.1275 0.752558 0.3700 0.7767 0.16670 0
3 1949.0 4.0 2 United States of America 63.0 17.0 33.0 13.0 1.9395 1.0 0.1111 0.752558 0.3651 0.5397 0.51610 0
4 1950.0 5.0 2 United States of America 53.0 26.0 25.0 2.0 1.8651 1.0 0.1731 0.752558 0.5094 0.8113 0.60420 0
5 1951.0 6.0 2 United States of America 25.0 10.0 11.0 4.0 1.8919 1.0 0.1200 0.752558 0.3600 0.6400 0.65220 0
6 1952.0 7.0 2 United States of America 49.0 25.0 19.0 5.0 1.9617 1.0 0.1429 0.752558 0.3061 0.6531 0.63270 0
7 1953.0 8.0 2 United States of America 25.0 12.0 6.0 7.0 1.7707 1.0 0.2000 0.752558 0.3333 0.6400 0.52000 0
8 1954.0 9.0 2 United States of America 30.0 18.0 3.0 9.0 1.5565 1.0 0.2000 0.752558 0.3000 0.6333 0.43330 0
9 1955.0 10.0 2 United States of America 27.0 13.0 8.0 6.0 1.8166 1.0 0.1481 0.752558 0.1111 0.7778 0.48150 0
10 1956.0 11.0 2 United States of America 60.0 44.0 11.0 5.0 1.3449 1.0 0.2167 0.752558 0.3667 0.9167 0.58330 0
11 1957.0 12.0 2 United States of America 34.0 22.0 7.0 5.0 1.3156 1.0 0.2353 0.752558 0.4118 0.8235 0.61760 0
12 1958.0 13.0 2 United States of America 33.0 25.0 4.0 4.0 1.3081 1.0 0.3030 0.752558 0.4848 0.8485 0.65630 0
13 1959.0 14.0 2 United States of America 54.0 23.0 17.0 14.0 1.6179 1.0 0.1296 0.752558 0.3396 0.7593 0.70590 0
14 1960.0 15.0 2 United States of America 103.0 53.0 36.0 14.0 1.5740 1.0 0.2330 0.752558 0.2913 0.7184 0.67350 0
15 1961.0 16.0 2 United States of America 73.0 36.0 26.0 11.0 1.7276 1.0 0.1096 0.752558 0.2466 0.6986 0.61640 0
16 1962.0 17.0 2 United States of America 46.0 28.0 13.0 5.0 1.9215 1.0 0.1739 0.752558 0.4889 0.6087 0.52380 0
17 1963.0 18.0 2 United States of America 31.0 15.0 6.0 10.0 1.9040 1.0 0.1290 0.752558 0.3871 0.5484 0.50000 0
18 1965.0 20.0 2 United States of America 40.0 14.0 15.0 11.0 2.0057 1.0 0.2250 0.752558 0.3500 0.5750 0.53850 0
19 1966.0 21.0 2 United States of America 50.0 19.0 20.0 11.0 2.0598 1.0 0.1600 0.752558 0.2200 0.6000 0.68750 0
In [ ]:
df.tail(20)
Out[ ]:
year assembly_session state_code state_name all_votes yes_votes no_votes abstain idealpoint_estimate affinityscore_usa affinityscore_russia affinityscore_china affinityscore_india affinityscore_brazil affinityscore_israel cluster
9685 1996.0 51.0 990 Samoa 74.0 64.0 0.0 10.0 0.2483 0.3378 0.6575 0.6892 0.6216 0.8630 0.2917 2
9686 1997.0 52.0 990 Samoa 66.0 56.0 1.0 9.0 0.2542 0.3485 0.6818 0.6818 0.6212 0.8636 0.2769 2
9687 1998.0 53.0 990 Samoa 57.0 49.0 0.0 8.0 0.1540 0.2456 0.6667 0.6429 0.6140 0.8772 0.2281 2
9688 1999.0 54.0 990 Samoa 62.0 50.0 1.0 11.0 0.2465 0.2742 0.7258 0.5968 0.6129 0.8548 0.2787 2
9689 2000.0 55.0 990 Samoa 62.0 52.0 2.0 8.0 0.2734 0.2742 0.6774 0.6613 0.6129 0.8710 0.2419 2
9690 2001.0 56.0 990 Samoa 37.0 30.0 3.0 4.0 0.2417 0.3784 0.5405 0.5405 0.5135 0.7568 0.3784 0
9691 2002.0 57.0 990 Samoa 62.0 52.0 2.0 8.0 0.1959 0.1613 0.6167 0.6885 0.6290 0.8387 0.1967 2
9692 2003.0 58.0 990 Samoa 67.0 55.0 1.0 11.0 0.1806 0.1493 0.7164 0.6818 0.6269 0.7612 0.1818 2
9693 2004.0 59.0 990 Samoa 65.0 51.0 2.0 12.0 0.2033 0.1692 0.6563 0.6406 0.5692 0.6308 0.1563 2
9694 2005.0 60.0 990 Samoa 71.0 58.0 2.0 11.0 0.1984 0.1714 0.6429 0.7143 0.6479 0.7746 0.2239 2
9695 2006.0 61.0 990 Samoa 80.0 59.0 3.0 18.0 0.2754 0.1875 0.6500 0.6835 0.6076 0.7750 0.2911 2
9696 2007.0 62.0 990 Samoa 66.0 55.0 1.0 10.0 0.1086 0.0909 0.6769 0.7846 0.6970 0.8333 0.2121 2
9697 2008.0 63.0 990 Samoa 69.0 58.0 2.0 9.0 0.1493 0.1449 0.6232 0.6866 0.6377 0.7971 0.2464 2
9698 2009.0 64.0 990 Samoa 64.0 50.0 2.0 12.0 0.1717 0.1719 0.6563 0.7344 0.6406 0.7500 0.1587 2
9699 2010.0 65.0 990 Samoa 65.0 53.0 2.0 10.0 0.2148 0.2000 0.6462 0.7188 0.6462 0.8000 0.2031 2
9700 2011.0 66.0 990 Samoa 59.0 48.0 1.0 10.0 0.2350 0.2881 0.6102 0.6780 0.6271 0.7458 0.2105 2
9701 2012.0 67.0 990 Samoa 68.0 56.0 0.0 12.0 0.2359 0.1912 0.6324 0.6515 0.6618 0.7941 0.1765 2
9702 2013.0 68.0 990 Samoa 62.0 51.0 0.0 11.0 0.1735 0.1935 0.5806 0.7097 0.6452 0.7742 0.1500 2
9703 2014.0 69.0 990 Samoa 75.0 65.0 0.0 10.0 0.1007 0.2344 0.5714 0.6984 0.6615 0.8154 0.2000 2
9704 2015.0 70.0 990 Samoa 67.0 59.0 0.0 8.0 -0.0227 0.2090 0.5672 0.6866 0.6567 0.8507 0.1642 2
In [ ]:
print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9705 entries, 0 to 9704
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   year                  9697 non-null   float64
 1   assembly_session      9697 non-null   float64
 2   state_code            9705 non-null   int64  
 3   state_name            9697 non-null   object 
 4   all_votes             9697 non-null   float64
 5   yes_votes             9697 non-null   float64
 6   no_votes              9697 non-null   float64
 7   abstain               9697 non-null   float64
 8   idealpoint_estimate   9697 non-null   float64
 9   affinityscore_usa     9696 non-null   float64
 10  affinityscore_russia  9692 non-null   float64
 11  affinityscore_china   7608 non-null   float64
 12  affinityscore_india   9696 non-null   float64
 13  affinityscore_brazil  9696 non-null   float64
 14  affinityscore_israel  9585 non-null   float64
dtypes: float64(13), int64(1), object(1)
memory usage: 1.1+ MB
None
In [ ]:
df.isnull().sum()
Out[ ]:
year                       8
assembly_session           8
state_code                 0
state_name                 8
all_votes                  8
yes_votes                  8
no_votes                   8
abstain                    8
idealpoint_estimate        8
affinityscore_usa          9
affinityscore_russia      13
affinityscore_china     2097
affinityscore_india        9
affinityscore_brazil       9
affinityscore_israel     120
dtype: int64
In [ ]:
# Drop rows with null values in the 'state_name' column
df = df.dropna(subset=['state_name'])
In [ ]:
df.isnull().sum()
Out[ ]:
year                       0
assembly_session           0
state_code                 0
state_name                 0
all_votes                  0
yes_votes                  0
no_votes                   0
abstain                    0
idealpoint_estimate        0
affinityscore_usa          1
affinityscore_russia       5
affinityscore_china     2089
affinityscore_india        1
affinityscore_brazil       1
affinityscore_israel     112
dtype: int64
In [ ]:
sns.heatmap(df.isnull(),yticklabels=False,cmap='flare')
Out[ ]:
<Axes: >
No description has been provided for this image
In [ ]:
df.shape
Out[ ]:
(9697, 15)
In [ ]:
# Calculate the mean of each column
mean_affinityscore_usa = df['affinityscore_usa'].mean()
mean_affinityscore_russia = df['affinityscore_russia'].mean()
mean_affinityscore_india = df['affinityscore_india'].mean()
mean_affinityscore_brazil = df['affinityscore_brazil'].mean()
mean_affinityscore_israel = df['affinityscore_israel'].mean()
mean_affinityscore_china = df['affinityscore_china'].mean()

# Replace null values with mean of each column using .loc accessor
df.loc[df['affinityscore_usa'].isnull(), 'affinityscore_usa'] = mean_affinityscore_usa
df.loc[df['affinityscore_russia'].isnull(), 'affinityscore_russia'] = mean_affinityscore_russia
df.loc[df['affinityscore_india'].isnull(), 'affinityscore_india'] = mean_affinityscore_india
df.loc[df['affinityscore_brazil'].isnull(), 'affinityscore_brazil'] = mean_affinityscore_brazil
df.loc[df['affinityscore_israel'].isnull(), 'affinityscore_israel'] = mean_affinityscore_israel
df.loc[df['affinityscore_china'].isnull(), 'affinityscore_china'] = mean_affinityscore_china
In [ ]:
import matplotlib.pyplot as plt

# Calculate the mean affinity score for each country
mean_affinity_usa = df['affinityscore_usa'].mean()
mean_affinity_russia = df['affinityscore_russia'].mean()
mean_affinity_china = df['affinityscore_china'].mean()
mean_affinity_india = df['affinityscore_india'].mean()
mean_affinity_brazil = df['affinityscore_brazil'].mean()
mean_affinity_israel = df['affinityscore_israel'].mean()

# Create lists of means and corresponding countries
countries = ['USA', 'Russia', 'China', 'India', 'Brazil', 'Israel']
means = [mean_affinity_usa, mean_affinity_russia, mean_affinity_china, mean_affinity_india, mean_affinity_brazil, mean_affinity_israel]

# Plotting the mean affinity scores
plt.figure(figsize=(10, 6))
plt.bar(countries, means, color=['blue', 'violet', 'red', 'orange', 'purple', 'pink'])
plt.xlabel('Country')
plt.ylabel('Mean Affinity Score')
plt.title('Mean Affinity Score Comparison of Countries')
plt.grid(axis='y')
plt.show()
No description has been provided for this image
  • The graph compares the mean affinity score of six countries: USA, Russia, China, India, Brazil and Israel.

  • The countries on the x-axis are listed alphabetically from Brazil to USA. The y-axis shows the mean affinity score. The scale goes from 0 to 0.7

  • The countries with the highest mean affinity score are China and India (at about 0.65). The United States and Russia have the lowest mean affinity score (at about 0.15).

In [ ]:
df.isnull().sum()
Out[ ]:
year                    0
assembly_session        0
state_code              0
state_name              0
all_votes               0
yes_votes               0
no_votes                0
abstain                 0
idealpoint_estimate     0
affinityscore_usa       0
affinityscore_russia    0
affinityscore_china     0
affinityscore_india     0
affinityscore_brazil    0
affinityscore_israel    0
dtype: int64
In [ ]:
df.describe()
Out[ ]:
year assembly_session state_code all_votes yes_votes no_votes abstain idealpoint_estimate affinityscore_usa affinityscore_russia affinityscore_china affinityscore_india affinityscore_brazil affinityscore_israel
count 9697.000000 9697.000000 9697.000000 9697.000000 9697.000000 9697.000000 9697.000000 9697.000000 9697.000000 9697.000000 9697.000000 9697.000000 9697.000000 9697.000000
mean 1987.037125 42.037125 446.914613 75.246674 59.793235 5.647932 9.805507 -0.000279 0.293589 0.620304 0.752558 0.687353 0.733821 0.350540
std 18.478671 18.478671 258.472010 33.043167 32.657278 8.268681 9.713783 0.989763 0.203660 0.202982 0.160692 0.195748 0.186575 0.189557
min 1946.000000 1.000000 2.000000 1.000000 0.000000 0.000000 0.000000 -2.562400 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 1973.000000 28.000000 220.000000 58.000000 38.000000 0.000000 4.000000 -0.661100 0.140800 0.512200 0.723100 0.526300 0.615400 0.193500
50% 1989.000000 44.000000 437.000000 68.000000 57.000000 2.000000 7.000000 -0.175500 0.235300 0.652200 0.761200 0.754100 0.800000 0.325900
75% 2003.000000 58.000000 660.000000 86.000000 71.000000 9.000000 13.000000 0.808900 0.388100 0.737700 0.869600 0.835800 0.880000 0.466700
max 2015.000000 70.000000 990.000000 158.000000 156.000000 98.000000 73.000000 3.004200 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
In [ ]:
df.nunique()
Out[ ]:
year                      69
assembly_session          69
state_code               198
state_name               197
all_votes                158
yes_votes                157
no_votes                  70
abstain                   71
idealpoint_estimate     8377
affinityscore_usa       2257
affinityscore_russia    2411
affinityscore_china     1687
affinityscore_india     2276
affinityscore_brazil    2120
affinityscore_israel    2023
dtype: int64
In [ ]:
import matplotlib.pyplot as plt
# Histograms for each numerical feature
df.hist(bins=20, figsize=(15, 10))
plt.show()
No description has been provided for this image
In [ ]:
countries = ['USA', 'Russia', 'China', 'India', 'Brazil', 'Israel']
means = [mean_affinity_usa, mean_affinity_russia, mean_affinity_china, mean_affinity_india, mean_affinity_brazil, mean_affinity_israel]

# Plotting the mean affinity scores as a pie chart
plt.figure(figsize=(8, 8))
plt.pie(means, labels=countries, autopct='%1.1f%%', startangle=140)
plt.title('Mean Affinity Score Comparison of Countries')
plt.show()
No description has been provided for this image
In [ ]:
import pandas as pd
import matplotlib.pyplot as plt

# Assuming you have loaded your dataset into a DataFrame named df
# Let's group the DataFrame by 'state_name' and 'assembly_session' and calculate the sum of votes
grouped_df = df.groupby(['state_name', 'assembly_session']).sum().reset_index()

# Stacked Bar Plot for Yes, No, and Abstain votes
plt.figure(figsize=(20, 15))
for country, data in grouped_df.groupby('state_name'):
    plt.bar(data['assembly_session'], data['yes_votes'], label=f'{country} Yes', alpha=0.7)
    plt.bar(data['assembly_session'], data['no_votes'], bottom=data['yes_votes'], label=f'{country} No', alpha=0.7)
    plt.bar(data['assembly_session'], data['abstain'], bottom=data['yes_votes']+data['no_votes'], label=f'{country} Abstain', alpha=0.7)

plt.xlabel('Assembly Session')
plt.ylabel('Votes')
plt.title('Votes by Country and Assembly Session')
plt.legend()
plt.show()
/Users/shrutimall/.local/pipx/.cache/5c9468f9a0a782a/lib/python3.12/site-packages/IPython/core/pylabtools.py:170: UserWarning: Creating legend with loc="best" can be slow with large amounts of data.
  fig.canvas.print_figure(bytes_io, **kw)
No description has been provided for this image
In [ ]:
 
In [ ]:
import matplotlib.pyplot as plt

# Assuming you have already defined grouped_df correctly
plt.figure(figsize=(15, 10))
for country, data in grouped_df.groupby('state_name'):
    plt.fill_between(data['assembly_session'], data['yes_votes'], label=f'{country} Yes', alpha=0.7)
    plt.fill_between(data['assembly_session'], data['no_votes'], label=f'{country} No', alpha=0.7)
    plt.fill_between(data['assembly_session'], data['abstain'], label=f'{country} Abstain', alpha=0.7)

plt.xlabel('Assembly Session')
plt.ylabel('Votes')
plt.title('Votes by Country and Assembly Session')
plt.legend()
plt.show()
No description has been provided for this image
In [ ]:
# Create a pivot table for heatmap
heatmap_df = grouped_df.pivot(index='assembly_session', columns='state_name', values=['yes_votes', 'no_votes', 'abstain'])

# Heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(heatmap_df, cmap='viridis', linewidths=0.5)
plt.title('Votes by Country and Assembly Session')
plt.xlabel('Country')
plt.ylabel('Assembly Session')
plt.show()
No description has been provided for this image
In [ ]:
import pandas as pd
import matplotlib.pyplot as plt

# Assuming your DataFrame is named df

# Filter data for the desired countries
desired_countries = ['Brazil', 'China', 'United States of America', 'Israel', 'Russia', 'India']
filtered_df = df[df['state_name'].isin(desired_countries)]

# Group by assembly session and calculate sum of yes, no, and abstain votes
grouped_df = filtered_df.groupby(['assembly_session', 'state_name']).sum().reset_index()

# Plotting
fig, axes = plt.subplots(nrows=3, ncols=1, figsize=(12, 18))  # Adjusted figsize
fig.patch.set_facecolor('#f5f5f5')  # Set background color for the figure
fig.patch.set_linewidth(1)  # Set border width for the figure
fig.patch.set_edgecolor('black')  # Set border color for the figure

# Set background color for the plot area
for ax in axes:
    ax.set_facecolor('#e5e5e5')  # Set background color for the plot area
    ax.spines['top'].set_visible(True)  # Show top spine/border
    ax.spines['right'].set_visible(True)  # Show right spine/border
    ax.spines['bottom'].set_linewidth(1)  # Set border width for the bottom spine
    ax.spines['left'].set_linewidth(1)  # Set border width for the left spine
    ax.spines['top'].set_linewidth(1)  # Set border width for the top spine
    ax.spines['right'].set_linewidth(1)  # Set border width for the right spine

# Plot for Yes votes
axes[0].set_title('Yes Votes by Country for Each Assembly Session', fontsize=16, fontweight='bold', color='blue')
for country, data in grouped_df.groupby('state_name'):
    axes[0].plot(data['assembly_session'], data['yes_votes'], label=country, linewidth=2)
axes[0].set_xlabel('Assembly Session', fontsize=14)
axes[0].set_ylabel('Yes Votes', fontsize=14)
axes[0].legend(fontsize=12)
axes[0].grid(True, linestyle='--', alpha=0.5)  # Add grid lines

# Plot for No votes
axes[1].set_title('No Votes by Country for Each Assembly Session', fontsize=16, fontweight='bold', color='red')
for country, data in grouped_df.groupby('state_name'):
    axes[1].plot(data['assembly_session'], data['no_votes'], label=country, linewidth=2)
axes[1].set_xlabel('Assembly Session', fontsize=14)
axes[1].set_ylabel('No Votes', fontsize=14)
axes[1].legend(fontsize=12)
axes[1].grid(True, linestyle='--', alpha=0.5)  # Add grid lines

# Plot for Abstain votes
axes[2].set_title('Abstain Votes by Country for Each Assembly Session', fontsize=16, fontweight='bold', color='green')
for country, data in grouped_df.groupby('state_name'):
    axes[2].plot(data['assembly_session'], data['abstain'], label=country, linewidth=2)
axes[2].set_xlabel('Assembly Session', fontsize=14)
axes[2].set_ylabel('Abstain Votes', fontsize=14)
axes[2].legend(fontsize=12)
axes[2].grid(True, linestyle='--', alpha=0.5)  # Add grid lines

plt.tight_layout()
plt.show()
No description has been provided for this image
In [ ]:
# Calculate the mean affinity score for each country
mean_affinity_usa = df['affinityscore_usa'].mean()
mean_affinity_russia = df['affinityscore_russia'].mean()
mean_affinity_china = df['affinityscore_china'].mean()
mean_affinity_india = df['affinityscore_india'].mean()
mean_affinity_brazil = df['affinityscore_brazil'].mean()
mean_affinity_israel = df['affinityscore_israel'].mean()

# Create lists of means and corresponding countries
countries = ['USA', 'Russia', 'China', 'India', 'Brazil', 'Israel']
means = [mean_affinity_usa, mean_affinity_russia, mean_affinity_china, mean_affinity_india, mean_affinity_brazil, mean_affinity_israel]

# Create the Highcharts HTML code
highcharts_html = """
<!DOCTYPE html>
<html>
<head>
  <title>Mean Affinity Score Comparison of Countries</title>
  <script src="https://code.highcharts.com/highcharts.js"></script>
</head>
<body>

<div id="container" style="width: 600px; height: 400px; margin: 0 auto"></div>

<script>
Highcharts.chart('container', {{
  chart: {{
    type: 'bar'
  }},
  title: {{
    text: 'Mean Affinity Score Comparison of Countries'
  }},
  xAxis: {{
    categories: {categories}
  }},
  yAxis: {{
    title: {{
      text: 'Mean Affinity Score'
    }}
  }},
  series: [{{
    name: 'Mean Affinity Score',
    data: {means}
  }}]
}});
</script>

</body>
</html>
""".format(categories=countries, means=means)


# Save the HTML output to a file or display it
print(highcharts_html)
<!DOCTYPE html>
<html>
<head>
  <title>Mean Affinity Score Comparison of Countries</title>
  <script src="https://code.highcharts.com/highcharts.js"></script>
</head>
<body>

<div id="container" style="width: 600px; height: 400px; margin: 0 auto"></div>

<script>
Highcharts.chart('container', {
  chart: {
    type: 'bar'
  },
  title: {
    text: 'Mean Affinity Score Comparison of Countries'
  },
  xAxis: {
    categories: ['USA', 'Russia', 'China', 'India', 'Brazil', 'Israel']
  },
  yAxis: {
    title: {
      text: 'Mean Affinity Score'
    }
  },
  series: [{
    name: 'Mean Affinity Score',
    data: [0.29358908828382835, 0.620303931077177, 0.7525578338590958, 0.6873531043729373, 0.7338206373762376, 0.3505397496087637]
  }]
});
</script>

</body>
</html>

In [ ]:
html_content = """
<!DOCTYPE html>
<html>
<head>
  <title>Mean Affinity Score Comparison of Countries</title>
  <script src="https://code.highcharts.com/highcharts.js"></script>
</head>
<body>

<div id="container" style="width: 600px; height: 400px; margin: 0 auto"></div>

<script>
Highcharts.chart('container', {
  chart: {
    type: 'bar'
  },
  title: {
    text: 'Mean Affinity Score Comparison of Countries'
  },
  xAxis: {
    categories: ['USA', 'Russia', 'China', 'India', 'Brazil', 'Israel']
  },
  yAxis: {
    title: {
      text: 'Mean Affinity Score'
    }
  },
  series: [{
    name: 'Mean Affinity Score',
    data: [0.29358908828382835, 0.620303931077177, 0.6659521023958656, 0.6873531043729373, 0.7338206373762376, 0.3505397496087637]
  }]
});
</script>

</body>
</html>
"""

# Write the HTML content to a file
with open('affinity_score_comparison.html', 'w') as file:
    file.write(html_content)

print("HTML file saved successfully!")
HTML file saved successfully!
In [ ]:
import pandas as pd
import matplotlib.pyplot as plt

# Assuming 'df' is your DataFrame with columns 'yes_votes', 'no_votes', and 'abstain'

# Step 1: Calculate the sum of each column
sum_yes_votes = df['yes_votes'].sum()
sum_no_votes = df['no_votes'].sum()
sum_abstain = df['abstain'].sum()

# Step 2: Add up the sums of the three columns to get the total
total_votes = sum_yes_votes + sum_no_votes + sum_abstain

# Step 3: Calculate the percentage of each sum
yes_percentage = (sum_yes_votes / total_votes) * 100
no_percentage = (sum_no_votes / total_votes) * 100
abstain_percentage = (sum_abstain / total_votes) * 100

# Step 4: Create a pie chart
labels = ['Yes', 'No', 'Abstain']
sizes = [yes_percentage, no_percentage, abstain_percentage]
colors = ['green', 'red', 'yellow']
explode = (0.1, 0, 0)  # explode 1st slice (optional)

plt.pie(sizes, explode=explode, labels=labels, colors=colors, autopct='%1.1f%%', shadow=True, startangle=140)
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.title('Total Votes Distribution')
plt.show()
No description has been provided for this image
In [ ]:
# Time Series Analysis
plt.figure(figsize=(10, 6))
sns.lineplot(x='year', y='all_votes', data=df)
plt.title('Trend of All Votes over Years')
plt.xlabel('Year')
plt.ylabel('All Votes')
plt.show()
No description has been provided for this image
In [ ]:

In [ ]:
# Correlation Analysis
# correlation_matrix = df.corr()
# plt.figure(figsize=(10, 8))
# sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
# plt.title('Correlation Matrix')
# plt.show()
numeric_columns = df.select_dtypes(include=['float64', 'int64']).columns
numeric_data = df[numeric_columns]

# Visualize correlation matrix
correlation_matrix = numeric_data.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
No description has been provided for this image
In [ ]:
# Distribution Analysis
plt.figure(figsize=(12, 6))
sns.histplot(df['idealpoint_estimate'], kde=True, bins=30)
plt.title('Distribution of Ideal Point Estimate')
plt.xlabel('Ideal Point Estimate')
plt.ylabel('Frequency')
plt.show()
No description has been provided for this image

Random Forest Classifier¶

  • The Random forest classifier creates a set of decision trees from a randomly selected subset of the training set. It is a set of decision trees (DT) from a randomly selected subset of the training set and then It collects the votes from different decision trees to decide the final prediction. Sample Image
In [ ]:
# import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load the data (replace 'data.csv' with your actual file path)
# data = pd.read_csv('data.csv')

# Prepare data
X = df[['affinityscore_usa', 'affinityscore_russia',
          'affinityscore_china', 'affinityscore_india', 
          'affinityscore_brazil', 'affinityscore_israel']]
y = df['yes_votes'].apply(lambda x: 'high' if x > df['yes_votes'].mean() else 'low')  # Target variable

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the model
clf = RandomForestClassifier()
clf.fit(X_train, y_train)

# Predictions
y_pred = clf.predict(X_test)


# Model Evaluation
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Classification Report:")
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
Accuracy: 0.9010309278350516
Classification Report:
              precision    recall  f1-score   support

        high       0.86      0.92      0.89       840
         low       0.94      0.89      0.91      1100

    accuracy                           0.90      1940
   macro avg       0.90      0.90      0.90      1940
weighted avg       0.90      0.90      0.90      1940

[[773  67]
 [125 975]]
In [ ]:
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix

# Generate classification report
report = classification_report(y_test, y_pred, output_dict=True)

# Convert the classification report to a DataFrame
report_df = pd.DataFrame(report).transpose()

# Plot the heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(report_df.iloc[:-1, :-1], annot=True, cmap="YlGnBu", fmt=".2f")
plt.title('Classification Report Heatmap')
plt.xlabel('Metrics')
plt.ylabel('Class')
plt.show()
No description has been provided for this image

K Means¶

  • Unsupervised Machine Learning is the process of teaching a computer to use unlabeled, unclassified data and enabling the algorithm to operate on that data without supervision. Without any previous data training, the machine’s job in this case is to organize unsorted data according to parallels, patterns, and variations.
  • Sample Image
In [ ]:
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt



# Prepare data
X = df[['affinityscore_usa', 'affinityscore_russia',
          'affinityscore_china', 'affinityscore_india', 
          'affinityscore_brazil', 'affinityscore_israel']]

# Initialize and fit K-means model
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)

# Add cluster labels to the DataFrame
df['cluster'] = kmeans.labels_

# Visualize clusters
plt.figure(figsize=(10, 6))
sns.scatterplot(x='yes_votes', y='no_votes', hue='cluster', data=df, palette='Set1', legend='full')
plt.title('Clustering of States based on Voting Behavior')
plt.xlabel('Yes Votes')
plt.ylabel('No Votes')
plt.show()
/var/folders/4y/p5wbs2392ng1qdh47g8kk6x00000gn/T/ipykernel_38761/2939899213.py:17: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['cluster'] = kmeans.labels_
No description has been provided for this image
In [ ]:
from sklearn.metrics import silhouette_score

# Calculate silhouette score
silhouette_avg = silhouette_score(X, kmeans.labels_)
print("Silhouette Score:", silhouette_avg)
Silhouette Score: 0.40464240364072335
In [ ]:
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns

# Prepare data
X = df[['affinityscore_usa', 'affinityscore_russia',
          'affinityscore_china', 'affinityscore_india', 
          'affinityscore_brazil', 'affinityscore_israel']]

# Initialize and fit K-means model
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)

# Add cluster labels to the DataFrame
df['cluster'] = kmeans.labels_

# Visualize clusters
plt.figure(figsize=(10, 6))
sns.scatterplot(x='affinityscore_india', y='affinityscore_china', hue='cluster', data=df, palette='Set1', legend='full')
plt.title('Clustering of States based on Affinity Scores')
plt.xlabel('Affinity Score - Inida')
plt.ylabel('Affinity Score - China')
plt.show()
/var/folders/4y/p5wbs2392ng1qdh47g8kk6x00000gn/T/ipykernel_38761/2696341349.py:16: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['cluster'] = kmeans.labels_
No description has been provided for this image
In [ ]:
from sklearn.metrics import silhouette_score

# Calculate silhouette score
silhouette_avg = silhouette_score(X, kmeans.labels_)
print("Silhouette Score:", silhouette_avg)
Silhouette Score: 0.40464240364072335
In [ ]:
# Group the data by state and calculate the sum of votes for each category
state_votes = df.groupby('state_name')[['all_votes', 'yes_votes', 'no_votes', 'abstain']].sum()

# Plotting bar charts for each voting category
state_votes.plot(kind='bar', stacked=True, figsize=(50, 20))
plt.title('Voting Patterns Across Different States')
plt.xlabel('State')
plt.ylabel('Number of Votes')
plt.xticks(rotation=45)  # Rotate state names for better readability
plt.legend(title='Vote Category')
plt.tight_layout()  # Corrected attribute name
plt.show()
No description has been provided for this image
In [ ]:
# Calculate the sum of votes for each category
vote_distribution = df[['all_votes', 'yes_votes', 'no_votes', 'abstain']].sum()

# Plotting bar chart for vote distribution
plt.figure(figsize=(10, 6))
vote_distribution.plot(kind='bar', color=['blue', 'green', 'red', 'orange'])
plt.title('Vote Distribution')
plt.xlabel('Vote Type')
plt.ylabel('Number of Votes')
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability
plt.show()
No description has been provided for this image
In [ ]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score



# Data preprocessing
# For simplicity, let's assume 'state_name' is the target variable and other columns are features
X = df.drop(columns=['state_name'])  # Features
y = df['state_name']  # Target

# Encode categorical variables
le = LabelEncoder()
y = le.fit_transform(y)

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model training
model = RandomForestClassifier()
model.fit(X_train, y_train)

# Model evaluation
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
Accuracy: 0.5252577319587629
In [ ]:
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Data preprocessing
# For simplicity, let's assume 'state_name' is dropped as we are clustering based on voting patterns
X = df.drop(columns=['state_name'])

# Model training
kmeans = KMeans(n_clusters=3)  # Specify the number of clusters
kmeans.fit(X)

# Visualizing clusters
plt.scatter(X['no_votes'], X['yes_votes'], c=kmeans.labels_, cmap='viridis')
plt.xlabel('No Votes')
plt.ylabel('Yes Votes')
plt.title('Clustering of States based on Voting Patterns')
plt.show()
No description has been provided for this image

Linear Regression¶

  • Linear regression analysis is used to predict the value of a variable based on the value of another variable. The variable you want to predict is called the dependent variable. The variable you are using to predict the other variable's value is called the independent variable. Sample Image
In [ ]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Assuming df is your DataFrame containing the data

# Data preprocessing
X = df['affinityscore_china']  # Features
y = df['affinityscore_india']  # Target

# Reshape X and y to be two-dimensional arrays
X = X.values.reshape(-1, 1)  # Reshape X to a column vector
y = y.values.reshape(-1, 1)  # Reshape y to a column vector

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model training
model = LinearRegression()
model.fit(X_train, y_train)

# Model evaluation
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)
Mean Squared Error: 0.015783713134011237
In [ ]:
import matplotlib.pyplot as plt

# Plotting the relationship between the independent variable (X) and the dependent variable (y)
plt.figure(figsize=(8, 6))
plt.scatter(X, y, color='blue', alpha=0.5)
plt.title('Scatter Plot of affinityscore_china vs. affinityscore_india')
plt.xlabel('affinityscore_china')
plt.ylabel('affinityscore_india')
plt.grid(True)
plt.show()
No description has been provided for this image
In [ ]:
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.arima.model import ARIMA

# Assuming 'year' is the time variable and 'all_votes' is the target variable
ts_data = df[['year', 'all_votes']]
ts_data.set_index('year', inplace=True)

# Model training
model = ARIMA(ts_data, order=(5,1,0))
fit_model = model.fit()

# Forecasting
forecast_index = pd.date_range(start='2025', end='2050', freq='Y')  # Generating date range for forecasting
forecast = fit_model.forecast(steps=len(forecast_index))  # Forecasting until 2050

# Plotting forecast
plt.plot(ts_data, label='Actual')
plt.plot(forecast_index, forecast, label='Forecast')
plt.title('Forecasting Future Voting Patterns')
plt.xlabel('Year')
plt.ylabel('Number of Votes')
plt.legend()
plt.show()
/Users/shrutimall/.local/pipx/.cache/5c9468f9a0a782a/lib/python3.12/site-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: An unsupported index was provided and will be ignored when e.g. forecasting.
  self._init_dates(dates, freq)
/Users/shrutimall/.local/pipx/.cache/5c9468f9a0a782a/lib/python3.12/site-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: An unsupported index was provided and will be ignored when e.g. forecasting.
  self._init_dates(dates, freq)
/Users/shrutimall/.local/pipx/.cache/5c9468f9a0a782a/lib/python3.12/site-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: An unsupported index was provided and will be ignored when e.g. forecasting.
  self._init_dates(dates, freq)
/var/folders/4y/p5wbs2392ng1qdh47g8kk6x00000gn/T/ipykernel_38761/1475418556.py:14: FutureWarning: 'Y' is deprecated and will be removed in a future version, please use 'YE' instead.
  forecast_index = pd.date_range(start='2025', end='2050', freq='Y')  # Generating date range for forecasting
/Users/shrutimall/.local/pipx/.cache/5c9468f9a0a782a/lib/python3.12/site-packages/statsmodels/tsa/base/tsa_model.py:836: ValueWarning: No supported index is available. Prediction results will be given with an integer index beginning at `start`.
  return get_prediction_index(
/Users/shrutimall/.local/pipx/.cache/5c9468f9a0a782a/lib/python3.12/site-packages/statsmodels/tsa/base/tsa_model.py:836: FutureWarning: No supported index is available. In the next version, calling this method in a model without a supported index will result in an exception.
  return get_prediction_index(
No description has been provided for this image
In [ ]:
# Calculate residuals
residuals = y_test - y_pred

# Plot residuals
plt.figure(figsize=(8, 6))
plt.scatter(y_pred, residuals, color='blue', alpha=0.5)
plt.axhline(y=0, color='red', linestyle='--')
plt.title('Residuals Plot')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.show()
No description has been provided for this image

ARIMA(AutoRegressive Integrated Moving Average)¶

  • An autoregressive integrated moving average, or ARIMA, is a statistical analysis model that uses time series data to either better understand the data set or to predict future trends.

  • A statistical model is autoregressive if it predicts future values based on past values. For example, an ARIMA model might seek to predict a stock's future prices based on its past performance or forecast a company's earnings based on past periods.

    Sample Image

In [ ]:
from sklearn.metrics import mean_squared_error
import numpy as np

# Splitting the dataset into training and testing sets
train_size = int(len(target_data) * 0.8)  # Using 80% of the data for training
train_data, test_data = target_data.iloc[:train_size], target_data.iloc[train_size:]

# Model training
model = ARIMA(train_data, order=(5, 1, 0))
fit_model = model.fit()

# Forecasting
forecast = fit_model.forecast(steps=len(test_data))  # Forecasting on the testing data

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(test_data, forecast)

# Calculate Root Mean Squared Error (RMSE)
rmse = np.sqrt(mse)

print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
/Users/shrutimall/.local/pipx/.cache/5c9468f9a0a782a/lib/python3.12/site-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: An unsupported index was provided and will be ignored when e.g. forecasting.
  self._init_dates(dates, freq)
/Users/shrutimall/.local/pipx/.cache/5c9468f9a0a782a/lib/python3.12/site-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: An unsupported index was provided and will be ignored when e.g. forecasting.
  self._init_dates(dates, freq)
/Users/shrutimall/.local/pipx/.cache/5c9468f9a0a782a/lib/python3.12/site-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: An unsupported index was provided and will be ignored when e.g. forecasting.
  self._init_dates(dates, freq)
Mean Squared Error (MSE): 1145.1768315695354
Root Mean Squared Error (RMSE): 33.84046145621444
/Users/shrutimall/.local/pipx/.cache/5c9468f9a0a782a/lib/python3.12/site-packages/statsmodels/tsa/base/tsa_model.py:836: ValueWarning: No supported index is available. Prediction results will be given with an integer index beginning at `start`.
  return get_prediction_index(
/Users/shrutimall/.local/pipx/.cache/5c9468f9a0a782a/lib/python3.12/site-packages/statsmodels/tsa/base/tsa_model.py:836: FutureWarning: No supported index is available. In the next version, calling this method in a model without a supported index will result in an exception.
  return get_prediction_index(
In [ ]:
from sklearn.metrics import mean_squared_error
import numpy as np

# Splitting the dataset into training and testing sets
train_size = int(len(target_data) * 0.8)  # Using 80% of the data for training
train_data, test_data = target_data.iloc[:train_size], target_data.iloc[train_size:]

# Model training
model = ARIMA(train_data, order=(5, 1, 0))
fit_model = model.fit()

# Forecasting
forecast = fit_model.forecast(steps=len(test_data))  # Forecasting on the testing data

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(test_data, forecast)

# Calculate Root Mean Squared Error (RMSE)
rmse = np.sqrt(mse)

print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)

# Print the forecast
print("Forecast:")
print(forecast)
/Users/shrutimall/.local/pipx/.cache/5c9468f9a0a782a/lib/python3.12/site-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: An unsupported index was provided and will be ignored when e.g. forecasting.
  self._init_dates(dates, freq)
/Users/shrutimall/.local/pipx/.cache/5c9468f9a0a782a/lib/python3.12/site-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: An unsupported index was provided and will be ignored when e.g. forecasting.
  self._init_dates(dates, freq)
/Users/shrutimall/.local/pipx/.cache/5c9468f9a0a782a/lib/python3.12/site-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: An unsupported index was provided and will be ignored when e.g. forecasting.
  self._init_dates(dates, freq)
Mean Squared Error (MSE): 1145.1768315695354
Root Mean Squared Error (RMSE): 33.84046145621444
Forecast:
7757    64.542714
7758    64.554631
7759    65.089764
7760    64.674080
7761    64.790490
          ...    
9692    64.779412
9693    64.779412
9694    64.779412
9695    64.779412
9696    64.779412
Name: predicted_mean, Length: 1940, dtype: float64
/Users/shrutimall/.local/pipx/.cache/5c9468f9a0a782a/lib/python3.12/site-packages/statsmodels/tsa/base/tsa_model.py:836: ValueWarning: No supported index is available. Prediction results will be given with an integer index beginning at `start`.
  return get_prediction_index(
/Users/shrutimall/.local/pipx/.cache/5c9468f9a0a782a/lib/python3.12/site-packages/statsmodels/tsa/base/tsa_model.py:836: FutureWarning: No supported index is available. In the next version, calling this method in a model without a supported index will result in an exception.
  return get_prediction_index(
In [ ]: